OpenCores
no use no use 1/2 Next Last
x264 in or1ksim
by julius on Oct 27, 2009
julius
Posts: 363
Joined: Jul 1, 2008
Last seen: May 17, 2021

Hi guys,

I've got a version of x264 now running in or1ksim. This setup compiles in about 30 frames of CIF YUV into the ELF and encodes it, saving the data to memory. The application is built stand alone.

I have had to enable single-precision floating point to get this to run at at any meaningful rate in or1ksim. This required patching the current GCC port for OpenRISC, as well as or1ksim, fixing up floating point support.

Once the updated toolchain and or1ksim are built, I have compiled a patch for x264 which enables it to be compiled and run in or1ksim. Unfortunately I haven't had a chance to test this on hardware yet becuase I don't have a build of an OpenRISC system with FPU available to me...yet.

I'll go through the steps required to get this up and running. Unfortunately it does require a re-build of gcc and or1ksim. I'll try to make it as painless as possible.

Building toolchain with newlib and floating point support

You'll want to make a directory to work in, say under your home path called or32-build or something. We'll also install this new version of the toolchain to the path /opt/or32-newlib which you'll have to create and chmod a+rwx to allow normal users to write to it.

Binutils

Download the binutils sources, extract them, download the patch, apply it, create a build directory, then configure build and install binutilities.

user@host:~/or32-build$ wget ftp://ocuser:oc@orsoc.se/toolchain/binutils-2.18.50.tar.bz2
user@host:~/or32-build$ tar xzf binutils-2.18.50.tar.bz2
user@host:~/or32-build$ wget ftp://ocuser:oc@orsoc.se/toolchain/binutils-2.18.50.or32_fixed_patch-v2.1.bz2
user@host:~/or32-build$ cd binutils-2.18.50 && bzcat -dc ../binutils-2.18.50.or32_fixed_patch-v2.1.bz2 | patch -p1
user@host:~/or32-build$ mkdir b-bu && cd b-bu
user@host:~/or32-build/b-bu$ ../binutils-2.18.50/configure --target=or32-elf --prefix=/opt/or32-newlib --disable-checking
user@host:~/or32-build/b-bu$ make all install

Paths

If you have another install of the OpenRISC and it lives in /opt/or32-elf/bin, then I would suggest moving /opt/or32-elf to /opt/or32-uclibc, or something appropriate, and creating a symbolic link to whichever version of the toolchain you wish to use.

We are building the current toolchain in /opt/or32-newlib, and we've just installed binutilities there. We'll now create a symbolic link from this path to /opt/or32-elf allowing us to put just the single path /opt/or32-elf in our PATH variable.

If /opt/or32-elf already exists: user@host:~/or32-build$ sudo mv /opt/or32-elf /opt/or32-uclibc
user@host:~/or32-build$ sudo ln -s /opt/or32-newlib /opt/or32-elf

And add the following to the ~/.bashrc file:

PATH=$PATH:/opt/or32-elf/bin

GCC and newlib

We use newlib here instead of uClibc because it's easier to make standalone apps based on newlib.

You'll want to download both GCC and newlib sources, patch them, symlink newlib into GCC's directory, make a build directory, configure, make and install. This section essentially follows the same instructions as outlined on the OpenRISC GNU toolchain newlib install guide.

user@host:~/or32-build$ wget ftp://ftp.gnu.org/gnu/gcc/gcc-4.2.2/gcc-core-4.2.2.tar.bz2
user@host:~/or32-build$ tar xjf gcc-core-4.2.2.tar.bz2
user@host:~/or32-build$ wget ftp://ocuser:oc@orsoc.se/toolchain/gcc-4.2.2-or32-fp.patch.bz2
user@host:~/or32-build$ cd gcc-4.2.2 && bzcat -dc ../gcc-4.2.2-or32-fp.patch.bz2 | patch -p1
user@host:~/or32-build$ wget ftp://sources.redhat.com/pub/newlib/newlib-1.17.0.tar.gz
user@host:~/or32-build$ tar xzf newlib-1.17.0.tar.gz
user@host:~/or32-build$ wget ftp://ocuser:oc@orsoc.se/toolchain/newlib-1.17.0-or32.patch.bz2
user@host:~/or32-build$ cd newlib-1.17.0 && bzcat -dc ../newlib-1.17.0-or32.patch.bz2 | patch -p1
user@host:~/or32-build$ ln -s newlib-1.17.0/newlib gcc-4.2.2/newlib
user@host:~/or32-build$ ln -s newlib-1.17.0/libgloss gcc-4.2.2/libgloss
user@host:~/or32-build$ mkdir b-gcc && cd b-gcc
user@host:~/or32-build/b-gcc$ ../gcc-4.2.2/configure --target=or32-elf --prefix=/opt/or32-newlib --with-gnu-as --with-gnu-ld --disable-libssp --verbose --with-newlib --enable-languages=c
user@host:~/or32-build/b-gcc$ make all install

We'll also create a specs file for gcc. For more info checkout the OpenRISC GNU toolchain newlib install guide.

user@host:~/or32-build/b-gcc$ or32-elf-gcc -dumpspecs > /opt/or32-newlib/lib/gcc/or32-elf/4.2.2/specs

Edit this file and change the *endfile and *link sections to look like the following:

*endfile:
-lor32 -lc -lgcc -lc -lor32

*link:
/opt/or32-newlib/or32-elf/lib/or32.ld

or1ksim

This is a little simpler, just one download and patch and install.

user@host:~/or32-build$ wget ftp://ocuser:oc@orsoc.se/toolchain/or1ksim-0.3.0.tar.bz2
user@host:~/or32-build$ tar xjf or1ksim-0.3.0.tar.bz2
user@host:~/or32-build$ wget ftp://ocuser:oc@orsoc.se/toolchain/or1ksim-0.3.0-fp-patch.bz2
user@host:~/or32-build$ cd or1ksim-0.3.0
user@host:~/or32-build/or1ksim-0.3.0$ bzcat -dc ../or1ksim-0.3.0-fp-patch.bz2 | -patch -p1
user@host:~/or32-build/or1ksim-0.3.0$ ./configure --target=or32-elf --prefix=/opt/or32-newlib
user@host:~/or32-build/or1ksim-0.3.0$ make all install

Patching and running x264

Here I'll outline how to download, patch and run x264 in or1ksim.

I've created a patch based on git revision e381f6d of x264. Here's how to checkout the x264 repository and revert to that revision.

Obtain x264 sources and set revision

user@host:or32-x264$ git clone git://git.videolan.org/x264.git
user@host:or32-x264$ cd x264
user@host:or32-x264/x264$ git checkout e381f6d

You should see a message saying HEAD is now at e381f6d. We now want to get the patch and apply it.

If you have the OpenCores h264 project repository already checked out, just do an svn update in trunk and I've made a path called x264/patches which has the patch. At the time of writing the latest patch file is called x264-e381f6d-or32-or1ksim-with-fp-1.0.patch

If you don't have the repository, check it out with:

user@host:or32-x264$ svn co http://opencores.org/ocsvn/oc-h264-encoder/oc-h264-encoder

Patch x264 sources

Apply the patch to the revision e381f6d x264 sources.

user@host:or32-x264/x264$ patch -p1

Configure and build x264

The following command should be used to configure the patched x264. Ensure the path to the toolchain we build before is properly setup in your PATH variable.

user@host:or32-x264/x264$ ./configure --disable-avis-input --disable-mp4-output --disable-pthread --enable-debug --host=or32-linux --cross-prefix=or32-elf- --extra-cflags="-g -mhard-mul -mhard-div -mhard-float" --extra-ldflags="-Tlink.ld"

Now a simple make would do, but I've configured the Makefile to include some raw YUV data in a section of the resulting ELF that we run in the simulator. So, we must specify where a h264 bytestream is. Download one of the sample CIF files from here: http://www.tkn.tu-berlin.de/research/evalvid/cif.html. In my example I'll download the Foreman video. This is then turned into YUV frames with the program ffmpeg, ensure that is installed too.

user@host:or32-x264$ wget http://www.tkn.tu-berlin.de/research/evalvid/cif/foreman_cif.264

Specify the location of this file when calling make with the H264_VIDEO_FILE variable on the command line, or edit the Makefile and set this variable to the appropriate path.

user@host:or32-x264/x264$ H264_VIDEO_FILE=../foreman_cif.264 make

The make process will generate some YUV data, about 5MB big (30 frames of CIF 4:2:0) which will be linked into the resulting ELF.

It also pre-calculates a large array of values that it used to take a long time to do in software each time it started, so to save a lot of time I've created a bash script which generates this array and saves it in C format. However,this can still take a long time, but it only gets done once, and at the speed of your host computer, not at the speed of a simulated OR1k. However, you'll notice when x264 is run, it still takes a number of seconds at the beginning in the function x264_analyse_init_costs() generating a large array for each value of lambda. If there's some way around this anyone is aware of please let us know so we can make this step more efficient.

Run x264 in or1ksim

Finally! We get to run x264 in or1ksim.

There is a rule to make and run the executable in or1ksim, just do make sim, but to do it manually you can do the following:

user@host:or32-x264/x264$ or32-elf-sim -f or1ksim_x264.cfg x264

This is using the included or1ksim configuration file, or1ksim_x264.cfg

You should see or1ksim boot, then the application start, there's some debugging output still in there, but then you'll see a line for each frame it encodes, before finishing and printing out the stats of the encode.

Configuring x264

Since we're running this standalone, the encoding configuration is hardcoded in the application. I currently have it set to the lowest quality, the ultrafast preset.

Look in x264.c in the main() function where the param struct is being altered, this determines what the encoder does. The comments indicate which set of assignments is doing what. I took the ultrafast preset settings from the Parse() function, look there for other presets if you're interested in playing around with it.

If changing the resolution of the file, (from CIF to QCIF, or larger even) be sure to set the param.i_width and param.i_height parameters here too. Perhaps these should be set with a define or something.

Why is this useful?

This gives us a platform to now easily investigate potential acceleration blocks by implementing them as a C model first in or1ksim. I should think we'd develop the OpenRISC port of the x264 software as we go, resulting in software that should be ready to run on the hardware once the accelerator blocks are finalised and tested in the architectural sim. Of course, once the models of the accelerators are finished the tricky part is coding the RTL, however I think once you have a good idea of what the block should do from a C model a lot of the tedium of dealing with HDL compilers and slow simulations is eliminated.

Why bother with floating point for the or1200 - we're just going to add accelration blocks anyway, right?

I found the simulation, when using the software floating point libraries, to be incredibly slow. When I finally got single precision floating point stuff enabled it went a lot faster. We can't do double precision floating point on the 32-bit incarnation of the OpenRISC, that's a 64-bit only thing unfortunately.

Perhaps as we slowly offload the bulk of the calculation onto accelerator blocks and the CPU does less and less floating point, perhaps we'll switch it off, yielding us a saving on hardware, but right now in the architectural simulator it makes it a lot faster.

To do

There are many things to do regarding this software model.

  • or32-elf-profile crashes when parsing the generated profiling output from or1ksim. Fix this.
  • Figure out the easiest way to build memory-mapped modules for or1ksim.
  • Perhaps make most of x264's parameters definable by -DDEFINES or other things at compile time, instead of editing the C-code (although this is not a complete pain).
  • Validate what the software in the simulator is doing: somehow dump the data out of or1ksim so we can ensure it is generating correct h264 bytestreams (perhaps already do-able with GDB)
  • Tune or1ksim config file to generate somewhat accurate cycle counts
  • Configure the timer or1ksim better so the application has a better sense of time.
  • Maybe transfer this thing to a page somewhere instaed of a forum post

I hope this is useful.

Julius

RE: x264 in or1ksim
by gil_savir on Oct 27, 2009
gil_savir
Posts: 59
Joined: Dec 7, 2008
Last seen: May 10, 2021
great job, Julius!!!

should we develop the rtl on ORPSOC (in openRISC 1000 project page) environment?
RE: x264 in or1ksim
by julius on Oct 28, 2009
julius
Posts: 363
Joined: Jul 1, 2008
Last seen: May 17, 2021

should we develop the rtl on ORPSOC environment?


Yes, I think we'll take a lot of what ORPSoC has, in terms of scripts, benches and RTL and setup a copy in the h264 project repository.

We'll probably want to replace the arbiter with something we can alter because at the moment it's just some synthesised netlist.

Julius
RE: x264 in or1ksim
by ethanli on Nov 3, 2009
ethanli
Posts: 9
Joined: Sep 19, 2008
Last seen: Apr 27, 2012
Hi,

I followed the instruction to build x264. Everything seems good but the following

Edit this file and change the *endfile and *link sections to look like the following:

*endfile:
-lor32 -lc -lgcc -lc -lor32

*link:
/opt/or32-newlib/or32-elf/lib/or32.ld

Question 1:
What is the purpose for these two parts?

Question 2:
I didn't find "or32.ld" in my building directory. Does this link script have difference with the default link script?

Thanks,

Ethan
RE: x264 in or1ksim
by julius on Nov 3, 2009
julius
Posts: 363
Joined: Jul 1, 2008
Last seen: May 17, 2021
Question 1:
What is the purpose for these two parts?



From the newlib part of the OpenRISC GNU toolchain page: http://opencores.org/openrisc,gnu_toolchain#newlib :

The *endfile section specifies object files to include at the end of the link command. Here we specify certain libraries and the order in which they should be used. They are in this order, from left to right becuase libor32 requires things from the newlib libc, which requires things from the inbuilt libgcc, which utlimately needs a couple of calls (sbrk, exit) from libor32.

The *link section is for passing options to the linker, but we use it to simply specify which linking script to use.

Question 2:
I didn't find "or32.ld" in my building directory. Does this link script have difference with the default link script?


When you do "make install" of gcc that was built with newlib, this linker script should get installed automatically to $(PATH_YOU_INSTALLED_THE_TOOLS_TO)/or32-elf/lib . The file is located in the newlib source at newlib-1.17.0/libgloss/or32/or32.ld . If you alter any of the or32 newlib port's files and do a new "make install" this script in newlib-1.17.0/libgloss/or32/or32.ld gets re-written to wherever it got installed to. If you alter it in the install path, be wary of this.

Julius
RE: x264 in or1ksim
by ethanli on Nov 12, 2009
ethanli
Posts: 9
Joined: Sep 19, 2008
Last seen: Apr 27, 2012
Thanks for your reply.

After several retries, finally I got everything working. There are several tricks for me.

1. ln -s newlib-1.17.0/libgloss gcc-4.2.2/libgloss
ln -s newlib-1.17.0/newlib gcc-4.2.2/newlib
These commands are not working for me. I have to do,
cd newlib-1.17.0
ln -s libgloss ../gcc-4.2.2/libgloss
ln -s newlib ../gcc-4.2.2/newlib

2. Install ffmpeg without root previlage
follow the link http://dev.gemin-i.org/wiki/index.php/Ffmpeg_install_instructions
and run the following configure
CFLAGS="-L/home/ethan/lame-398-2/lib -I/home/yili/lame-398-2/include" LDFLAGS="-L/home/ethan/lame-398-2/lib" ./configure --enable-libmp3lame --enable-libvorbis --disable-mmx --enable-shared --disable-demuxer=v4l --disable-demuxer=v4l2 --disable-indev=v4l --disable-indev=v4l2 --enable-cross-compile

So far, I still didn't know if my build is correct or not.

Can you tell me how to output the encoded stream into a file? So I can run it to check if my build correct or not. Thanks,
RE: x264 in or1ksim
by julius on Nov 13, 2009
julius
Posts: 363
Joined: Jul 1, 2008
Last seen: May 17, 2021
Can you tell me how to output the encoded stream into a file?

I'll put up a new patch for x264 this weekend - I have put in printf at the end which lets the user know how much was written to memory, and where. From GDB you can then do a binary dump out to a file and check it. I'll do up a post with info in a day or so. The ability to verify that what we're doing isn't breaking or altering the output is important, so perhaps I'll find a way to checksum the encoder's output while it's in memory rather than doing this manual dump.

Julius

RE: x264 in or1ksim
by julius on Nov 15, 2009
julius
Posts: 363
Joined: Jul 1, 2008
Last seen: May 17, 2021

I've put up a new patch, x264-e381f6d-or32-or1ksim-with-fp-1.1.patch, in the trunk/x264/patches. It's just a little update, printing out the location of the data when it finishes encoding. Other changes include the configuration to use the fast preset now with no buffer lookahead (more appropriate for a streaming application) and a VBV buffer of only 4 frames big (I think this is useful?!)
I'll outline how to setup x264, compile, then run it in or1ksim, connect GDB, do the encode and then dump the encoded data out of the sim.

Patch x264 sources

Apply the patch to the revision e381f6d x264 sources. (See the original post for information about obtaining the source and setting the x264 revision.)

user@host:or32-x264/x264$ patch -p1

Configure and build x264

The following command should be used to configure the patched x264. Ensure the path to the toolchain we build before is properly setup in your PATH variable.

user@host:or32-x264/x264$ ./configure --disable-avis-input --disable-mp4-output --disable-pthread --enable-debug --host=or32-linux --cross-prefix=or32-elf- --extra-cflags="-g -mhard-mul -mhard-div -mhard-float" --extra-ldflags="-Tlink.ld"

Now a simple make would do, but I've configured the Makefile to include some raw YUV data in a section of the resulting ELF that we run in the simulator. So, we must specify where some sample video data is. Download one of the sample CIF files from here: http://www.tkn.tu-berlin.de/research/evalvid/cif.html. In my example I'll download the Foreman video. This is then turned into YUV frames with the program ffmpeg, ensure that is installed too.

user@host:or32-x264$ wget http://www.tkn.tu-berlin.de/research/evalvid/cif/foreman_cif.264

Specify the location of this file when calling make with the INPUT_VIDEO_FILE variable on the command line, or edit the Makefile and set this variable to the appropriate path.

user@host:or32-x264/x264$ INPUT_VIDEO_FILE=../foreman_cif.264 make

The make process will generate some YUV data, about 5MB big (30 frames of CIF 4:2:0 YUV) which will be linked into the resulting ELF.

It also pre-calculates a large array of values that it used to take a long time to do in software each time it started, so to save a lot of time I've created a bash script which generates this array and saves it in C format. However,this can still take a long time, but it only gets done once, and at the speed of your host computer, not at the speed of a simulated OR1k. However, you'll notice when x264 is run, it still takes a number of seconds at the beginning in the function x264_analyse_init_costs() generating a large array for each value of lambda. If there's some way around this anyone is aware of please let us know so we can make this step more efficient.

Run x264 in or1ksim and connect with GDB

Finally! We get to run x264 in or1ksim.

There is a rule to make and run the executable in or1ksim, just do make sim, but to do it manually you can do the following:

user@host:or32-x264/x264$ or32-elf-sim -f rsp_or1ksim_x264.cfg x264

This is using the included or1ksim configuration file, rsp_or1ksim_x264.cfg, which starts the simulator and then waits for GDB to connect.

You should see or1ksim boot, then wait for a connection from the debugger. I've hardcoded or1ksim to listen for GDB on port 5554, but you can change this in the or1ksim config file.

In another console window we want to start up GDB. Run the OpenRISC-compatible GDB (is usually built and installed with the OpenRISC GNU toolchain port), this is potentially in another OpenRISC toolchain path, it's fine to just copy it to your other toolchain path's bin directory or directly call it.

user@host:or32-x264/x264$ or32-elf-gdb x264

Now you'll be at the GDB prompt. The following list of commands connects to the or1ksim, and sets it running to the end of the main() function, where it will break and we can then dump the file.

(gdb) target remote localhost:5554
Remote debugging using localhost:5554
0x00000100 in ?? ()
(gdb) break main
Breakpoint 1 at 0x2760: file x264.c, line 116. (gdb) c
Continuing.

Breakpoint 1, main (argc=0, argv=0x0) at x264.c:116
116 x264_param_default( &param );
(gdb) finish
Run till exit from #0 main (argc=0, argv=0x0) at x264.c:116
0x00002098 in loop ()
Value returned is $1 = 0

Or1ksim should have printed out a line like the following:

close_file_bsf: wrote 81115 bytes from 0x01d59c00 to 0x01d6d8db

This indicates where the encoded video bytestream is in the system. We'll dump this out with the following command in GDB:

(gdb) dump binary memory sim_dump.264 0x01d59c00 0x01d6d8db

This will create the file sim_dump.264 containing the contents of the memory boundaries we gave.

Just tell GDB to continue once more and the code will make the simulator finish and shut down.

(gdb) c
Continuing.
Remote connection closed
(gdb) q

The video can be played with ffplay like so:

user@host:or32-x264/x264$ ffplay -s cif sim_dump.264

It's only thirty frames, but it confirms that the or1ksim is doing it's job.

To do

  • Figure out the easiest way to build memory-mapped modules for or1ksim (although I've mostly done this. Next post will be on this I think.)
  • Tune or1ksim config file to generate somewhat accurate cycle counts
  • Configure the timer or1ksim better so the application has a better sense of time.
  • Maybe transfer this thing to a page somewhere instaed of a forum post

Julius

RE: x264 in or1ksim
by julius on Nov 17, 2009
julius
Posts: 363
Joined: Jul 1, 2008
Last seen: May 17, 2021

Example hardware module model in or1ksim

I've taken the SAD and SSD loops from the x264 software and implemented them in "hardware". Performance wise I don't think this is particularly useful. I have done it just as an example of how to write a new module for or1ksim, the OpenRISC architectural simulator, and then use it from the x264 software. An attempt was made to make the performance of the module somewhat representative of what it might be like in a hardware implementation, by trying to provide accurate cycle counts of its computation.

New patches

There's a couple new patches up in the repository. One is a patch for or1ksim-0.3.0, implementing the example module, and one is a patch for x264, putting in code which uses the new module. Check out the repository, and the READMEs in the respective patch directories for or1ksim and x264.

Example SAD/SSD module

I found the SAD and SSD algorithms to be very simple, and similar, so I thought they were a good choice to implement in a single module.

Register interface

The functions in the C code (common/pixel.c) are generated with defines, one for each of SAD and SSD, and for each block size of 16x16,16x8,8x16,8x8,8x4,4x8 and 4x4. The function is passed pointers to the two sets of block data (top left pixel) and the strides to the next row. The functions then return a single integer value, the SAD or SSD value for that particular block.

The register interface is very simple. A register each for the function parameters, plus one for control, plus one for returning the result. I used the following struct:

typedef struct {
   uint32_t control;
   uint32_t result;
   uint32_t pix_ptr1;
   uint32_t pix_ptr2;
   uint32_t stride1;
   uint32_t stride2;
   uint32_t x;
   uint32_t y;
} x264_or32_sadssdmod_regs_t;

One bit in the control register indicates if we should do SAD/SSD, and another bit indicates when the module should should start processing. This start/busy bit is cleared by the module when it is finished. The software, after setting this bit, should poll it until it goes low. A very simple interface.

Of course the result is stored in the so-called register and this is read by the software after the busy bit goes low in the control register.

The patch implementing this in the software (x264-e381f6d-or32-or1ksim-with-fp-1.2.patch) modifies the defined SAD/SSD functions to use this module instead of doing it on the processor. See the file common/pixel.c in the patched x264 sources. The define OR32_SADSSDMOD (defined in common/or32/or32.h) controls whether this module is used or not.

Implementing SAD/SSD module in or1ksim

Here we look at building this simple module in or1ksim.

How or1ksim works, setting up the module

I hadn't worked too much with or1ksim before, and wasn't aware of the internal structure. It turns out it's very straight forward and easy to use.

A good example of a simple generic module is described in peripherals/generic.c in the or1ksim source.

Each module provides a few required, and many options functions. First is some sort of constructor (generic_sec_start() in peripheral/generic.c) which is called when or1ksim parses the config file and notices the start of a new section, as well as a function to properly instantiate the module when it has finished parsing the config file (generic_sec_end()). Others include some functions allowing access to configurable parameters (generic_name(), generic_size(), generic_baseaddr() etc.), and a function which registers these constructor/setter/getter functions so the main simulator knows how to call them (reg_generic_sec()). I'm using object-orientated terms here, but or1ksim is written entirely in C, not C++.

This function which registers the module's own functions must then be inserted into the simulator's startup routine, where it expects such registering of functions to be called. This function is reg_config_specs() in sim-config.c in or1ksim. You'll notice all other modules and peripherals have their section functions included there too.

So that's all good, but we're mainly interested in being memory-mapped accessible from the processor, and being able to access the system memory from our module.

Memory accesses

The memory subsystem of or1ksim requires each new module to provide it with information about which kind of accesses it supports (read and or write, 8, 16 and 32-bit wide accesses), its base address and its address size/space. In the function generic_sec_end() you can see that it checks which capabilities have been enabled (access size configurable only, not ro/rw/wo) and sets up a struct mem_ops variable accordingly. It then passes this struct, along with the module's base address and size (span of addresses) to the memory management system via the reg_mem_area() function. The simulator then knows where, how wide and what kind of accesses the module supports, making it accessible from the processor.

For memory accesses from the module to the rest of the system, the simulator provides a set of functions for reading and writing each of 8, 16 and 32-bit wide values. The read functions are eval_direct8(), eval_direct16() and eval_direct32(). The write functions are set_direct8(), set_direct16(), and set_direct32(). The first parameter is the address, the second for write functions is the value, and the last two of each are whether to go through the cache or MMU, used for statistical purposes only, so just ignore them in this case.

Setting up the SAD/SSD module

In the new patch for or1ksim implementing this module (to be applied ontop of the existing or1ksim patch enabling floating point capability) this module's code can be found in the video_enc/x264_sadssdmod.c file.

I chose only to have full 32-bit word read/write capability, and the only options which are configurable by the config file at runtime are the module's base address, name and whether or not it's enabled.

I've declared the same struct for its registers as in x264 at the top of this file, and made them accessible via the x264_sadssdmod_read_word() and x264_sadssmod_write_word() functions.

How do you get it to do things?

Along with registering the module's configuration functions and it's memory settings, you also register a reset function. This reset function, x264_sadssdmod_reset() in this case, is called once before simulation begins.

With the use of the SCHED_ADD(*function, void* data, int num_cycles) macro, we can schedule for a particular function to be called and passed a pointer to some data after num_cycles simulated clock cycles.

Using this feature we can then insert a hook for our function which does stuff. In this case, the stuff is the desired behavior of the module: monitor the module's registers, and react whenever the busy bit is set in the control register, performing either SAD or SSD calculations for the given parameters, leaving the calculated value in the result register.

The job function is registered like so: SCHED_ADD (x264_sadssdmod_job, dat,1) in the reset function, and is then called on the next cycle.

Implementing the SAD/SSD algorithm

The algorithm is very simple, in the case of SAD we accumulate the difference between each pixel in the current and reference block, and in SSD we square this difference and accumulate that instead.

Although it's not perfect, an attempt was made to code the module like an FSM in hardware. I initialise the module so that the job function is called each clock cycle. I've tried to make it do approximately one clock cycle's worth of work, before re-scheduling itself and returning.

All we're really doing once we get the go signal is a couple of for loops, so it was relatively easy to implement. So for each cycle where the module is activated and processing, I do one step of the algorithm. The following is the step for the SAD algorithm:

dev->regs->result +=
    abs(eval_direct8 (dev->state.pix_ptr1 + dev->state.x_count, 0, 0) -
    eval_direct8 (dev->state.pix_ptr2 + dev->state.x_count, 0, 0));

Some state variables, updated each cycle, allow us to track where we are in the loops. You can see here, the x_count state variable holds how far through the inner for loop we are.

It's probably not accurate to say this could occur in a single cycle, it would probably take several due to the two memory accesses, the subtraction, the absolute value calculation and the addition/accumulation. This can be tuned by changing the number of cycles we schedule before calling the function again.

Finally when the for loops are complete, the last thing the module does is clear the busy bit. The processor then reads this at some point after that, and reads the result register. The module returns to polling the busy bit, waiting for it to be asserted again.

Results

When encoding 5 frames of CIF, using the exact same simulator configuration, doing SAD/SSD on the processor takes 37,133,265,830 cycles, and when using the hardware module the simulator reports having done only 28,249,742,100 cycles, or 25% fewer cycles.

I did admit though, that this is optimisitic, as the hardware module model doesn't accurately represent the cycles taken for the step where it does the actual calculation.

Plus I think there's some tuning of the simulator as a whole to be done here. Using those cycle numbers, on a processor, even running at 100Mhz, it would take 37 seconds to encode just 5 frames. I'm yet to run this in hardwre, but I don't think that is right. Although there is one thing which is annoying, and that is the calculation of the a big cost vector in the x264_analyse_init_costs() function at the beginning which takes more than half the time I think.

Other or1ksim modules

The main idea of this was to show how to use or1ksim to model any potential hardware modules implemented to speedup x264. They can be processor independent and interfaced via a simple method. Interrupts could also be used for multiple modules running in parallel, for instance.

I think some useful tests of potential sw/hw partitioning can be done using this method. The tuning of the hardware module is important to get somewhat accurate performance impact results.

I hope this is useful.

Julius

RE: x264 in or1ksim
by ethanli on Nov 20, 2009
ethanli
Posts: 9
Joined: Sep 19, 2008
Last seen: Apr 27, 2012
I tried patch 1.1 and connected with GDB and run x264. But I got odd error.

Listening for RSP on port 5554
Remote debugging from host 0.0.0.0
get_frame_total_yuv: 352 288 4561920 30
x264 [info]: using cpu capabilities: none!
x264_analyse_init_costs: lambda (= x264_lambda_tab[10]) = 1
x264_analyse_init_costs: lambda (= x264_lambda_tab[16]) = 2
x264_analyse_init_costs: lambda (= x264_lambda_tab[20]) = 3
x264_analyse_init_costs: lambda (= x264_lambda_tab[23]) = 4
x264_analyse_init_costs: lambda (= x264_lambda_tab[26]) = 5
x264_analyse_init_costs: lambda (= x264_lambda_tab[27]) = 6
x264_analyse_init_costs: lambda (= x264_lambda_tab[29]) = 7
x264_analyse_init_costs: lambda (= x264_lambda_tab[30]) = 8
x264_analyse_init_costs: lambda (= x264_lambda_tab[31]) = 9
x264_analyse_init_costs: lambda (= x264_lambda_tab[32]) = 10
x264_analyse_init_costs: lambda (= x264_lambda_tab[33]) = 11
x264_analyse_init_costs: lambda (= x264_lambda_tab[34]) = 13
x264_analyse_init_costs: lambda (= x264_lambda_tab[35]) = 14
x264_analyse_init_costs: lambda (= x264_lambda_tab[36]) = 16
x264_analyse_init_costs: lambda (= x264_lambda_tab[37]) = 18
x264_analyse_init_costs: lambda (= x264_lambda_tab[38]) = 20
x264_analyse_init_costs: lambda (= x264_lambda_tab[39]) = 23
x264_analyse_init_costs: lambda (= x264_lambda_tab[40]) = 25
x264_analyse_init_costs: lambda (= x264_lambda_tab[41]) = 29
x264_analyse_init_costs: lambda (= x264_lambda_tab[42]) = 32
x264_analyse_init_costs: lambda (= x264_lambda_tab[43]) = 36
x264_analyse_init_costs: lambda (= x264_lambda_tab[44]) = 40
x264_analyse_init_costs: lambda (= x264_lambda_tab[45]) = 45
x264_analyse_init_costs: lambda (= x264_lambda_tab[46]) = 51
x264_analyse_init_costs: lambda (= x264_lambda_tab[47]) = 57
x264_analyse_init_costs: lambda (= x264_lambda_tab[48]) = 64
x264_analyse_init_costs: lambda (= x264_lambda_tab[49]) = 72
x264_analyse_init_costs: lambda (= x264_lambda_tab[50]) = 81
x264_analyse_init_costs: lambda (= x264_lambda_tab[51]) = 91
x264 [debug]: VBV maxrate unspecified, assuming CBR
x264 [info]: profile Baseline, level 3.0
x264 [debug]: frame= 0 QP=27.04 NAL=3 Slice:I Poc:0 I:396 P:0 SKIP:0 size=8996 bytes
x264 [debug]: frame= 1 QP=30.75 NAL=2 Slice:P Poc:2 I:3 P:195 SKIP:198 size=626 bytes
x264 [error]: malloc of size 585728 failed
x264 [error]: x264_encoder_encode failed

I don't believe my server run out of memory. It just needs around 600K. Someones said ffmpeg cause this problem.
RE: x264 in or1ksim
by julius on Nov 20, 2009
julius
Posts: 363
Joined: Jul 1, 2008
Last seen: May 17, 2021
I don't believe my server run out of memory.

Don't worry, it's the or1ksim's simulated memory that is running short here, not your actual system.

How big is the yuv_data.elf that gets generated? Maybe for some reason it's too big and using up all the RAM in the system? Although I don't think this is possible due to the linker script defining where it can go.

Are there any other modifications you've done to the x264 code or the code the patch adds? Are you sure you're running or1ksim and specifying the right simulation file with the -f option?

Julius
RE: x264 in or1ksim
by ethanli on Nov 23, 2009
ethanli
Posts: 9
Joined: Sep 19, 2008
Last seen: Apr 27, 2012

How big is the yuv_data.elf that gets generated? Maybe for some reason it's too big and using up all the RAM in the system? Although I don't think this is possible due to the linker script defining where it can go.


4562381 Nov 23 14:49 yuv_data.elf


Are there any other modifications you've done to the x264 code or the code the patch adds? Are you sure you're running or1ksim and specifying the right simulation file with the -f option?


No. Followed the post exactly. But or32-elf-gdb is from my uclibc, since I didn't find it fromy my newlib. It is a client, probably should be OK.

Even though I changed the config by increasing memory size, same problem. The error is really from malloc during my debugging.
RE: x264 in or1ksim
by kahomike on Nov 29, 2009
kahomike
Posts: 4
Joined: Aug 22, 2009
Last seen: Dec 14, 2009
Dear Julius:

Thanks a lot for your great work!

I can compile and run x264 in or1ksim. However, whenever I want to increase the number of reference frames used, e.g. when I set param.i_frame_reference = 2; in x264.c
I get an error and the encoding ends:

x264 [error]: malloc of size 585728 failed
x264 [error]: x264_encoder_encode failed
exit(-1)

May I ask how can this problem be solved?

Thanks a lot if you can help.

Regards,
Mike



RE: x264 in or1ksim
by julius on Nov 29, 2009
julius
Posts: 363
Joined: Jul 1, 2008
Last seen: May 17, 2021
May I ask how can this problem be solved?

This is a good question. I am not seeing this problem. It would be great if you could show me the commands you used to compile x264 and run it. Perhaps you could put this in a text file and attach it to a post (or using pastebin etc.), rather than pasting it all into a post!

I have been doing further work and hope to have a new patch out soon which will have some very handy features (exact memory accesses, instructions executed etc.) and a better setup. I aim to get some solid numbers of what kind of performance increases, in cycles, we need to achieve to make this system realisable on FPGA very soon (a few days).

But if you could post and attach a log of the compilation and execution of x264 in or1ksim that'd be great. From memory the heap variable is in the sbrk() function, which is part of newlib, and sounds to me like there's not enough memory being allocated by the linker. It's strange that I'm not seeing this. But sometimes when creating the patches I can make a mistake and not include a file like a linker script or something which results in subtle and hard to trace problems.

Julius
RE: x264 in or1ksim
by kahomike on Nov 30, 2009
kahomike
Posts: 4
Joined: Aug 22, 2009
Last seen: Dec 14, 2009
Dear Julius:

Thanks for your prompt reply.

I build the toolchain and patch x264 following exactly your post on Oct 27, 2009, i.e. I uses the patch x264-e381f6d-or32-or1ksim-with-fp-1.0.patch

Attached compile_and_run.txt contains output of compile and run of x264.

The other configs I modified in x264.c are:

param.analyse.intra = X264_ANALYSE_I8x8;
param.analyse.inter = X264_ANALYSE_PSUB16x16;
param.analyse.i_subpel_refine = 1;

but they are ok with param.i_frame_reference = 1 or 0;

For other param.i_frame_reference values, they all lead to the same error:

x264 [error]: malloc of size 585728 failed

(as shown in compile_and_run.txt)

Using other sequences, e.g. Akiyo, Stefan, have the same problem.

I also tried to increase the memory size by modifying the or1ksim_x264.cfg:

section memory
...
name = "RAM"
...
baseaddr = 0x00000000
size = 0x10000000 // original is size = 0x02000000
...
end

but the problem still exists.

Regards,
Mike
no use no use 1/2 Next Last
© copyright 1999-2024 OpenCores.org, equivalent to Oliscience, all rights reserved. OpenCores®, registered trademark.